House Price - Data Prep

1 Objetivos

  • Número de variáveis: 81
  • Tipo de variáveis
      • Inteiras ou discretas:
      • Numéricas ou double
      • Categóricas
      • Qualitativas
    • Qualidade dos dados
      • Quantidade de NA’s por variável
  • Criação de novas variáveis, se precisar
  • Transformação das variáveis, se precisar

2 Conjunto de Dados

2.1 Dados de treinamento

# Read table 
df.train <- data.table::fread('../dados/train.csv', 
                              sep=",", 
                              showProgress = FALSE) %>% 
            data.frame(stringsAsFactors = F)
df.train

3 Há quantos tipo de dados ?

lapply(df.train,class) %>%
  unlist %>% as.character() %>% table
## .
## character   integer 
##        43        38

3.1 Dados tipo inteiro

df.int <- df.train[,unlist(lapply(df.train,is.integer))]
df.int

4 Resumo dos dados

skim_to_wide(df.int) %>% arrange(desc(missing))

veja que há valores nulos presente na amostra, minha estratégia será remover variáveis que possuam muitos valores nulos e imputar dados para as variáveis com poucos valores nulos.

4.1 Pré-Seleção das Variáveis inteiras

set.seed(2510)
boruta.int <- Boruta(SalePrice ~ ., data=na.omit(df.int[,-1]), doTrace=2)  # perform Boruta search
cols.int <- names(boruta.int$finalDecision[boruta.int$finalDecision %in% c("Confirmed", "Tentative")])  # collect

4.2 Plot da importância da variável

plot(boruta.int, cex.axis=.7, las=2, xlab="", main="Variable Importance")

5 Selecionando as variáveis inteiras

df.int <- df.int[,c('Id',"SalePrice",cols.int)]

6 Qualidade dos dados

6.1 Tipo inteiro: % de valores nulos

apply(is.na(df.int),2,
            function(x) round(100*sum(as.numeric(x))/length(x),2)) %>% 
            sort(decreasing = T)
##   LotFrontage   GarageYrBlt    MasVnrArea            Id     SalePrice 
##         17.74          5.55          0.55          0.00          0.00 
##    MSSubClass       LotArea   OverallQual   OverallCond     YearBuilt 
##          0.00          0.00          0.00          0.00          0.00 
##  YearRemodAdd    BsmtFinSF1     BsmtUnfSF   TotalBsmtSF     X1stFlrSF 
##          0.00          0.00          0.00          0.00          0.00 
##     X2ndFlrSF     GrLivArea  BsmtFullBath      FullBath      HalfBath 
##          0.00          0.00          0.00          0.00          0.00 
##  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces    GarageCars 
##          0.00          0.00          0.00          0.00          0.00 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch   ScreenPorch 
##          0.00          0.00          0.00          0.00          0.00

De forma geral podemos concluir que não há muitos valores nulos para as variáveis do tipo inteiro, sendo que apenas a variável LotFrontage apresenta cerca de 17.74 % de valores ausentes e como há uma série de outras variáveis mais importantes que esta, decido por remove-la.

6.2 Removendo a variável LotFrontage

df.int <- df.int %>% select(-LotFrontage)

6.3 Imputando dados com knn

Criando um modelo com o knn para imputar dados.

preProcess_missingdata_model <- preProcess(df.int[,-c(1,2)], method='knnImpute')
preProcess_missingdata_model
## Created from 1371 samples and 27 variables
## 
## Pre-processing:
##   - centered (27)
##   - ignored (0)
##   - 5 nearest neighbor imputation (27)
##   - scaled (27)

Vamos agora usar esse modelo para prever os valores ausentes df.int

df.int1 <- predict(preProcess_missingdata_model, newdata = df.int[,-c(1,2)] )
anyNA(df.int1)
## [1] FALSE

Todos os valores ausentes foram imputados com sucesso, agora vamos juntar as colunas Id e SalesPrice com df.int1

df.int <- cbind(df.int[,c(1,2)],df.int1)

Convertento dados tipo string para tipo categórico

df.fac <- df.train[,unlist(lapply(df.train,is.character))] %>%
          apply(2,as.factor) %>% 
          data.frame()

6.4 Tipo Categorical: % de valores nulos

apply(is.na(df.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>% 
           sort(decreasing = T)
##        PoolQC   MiscFeature         Alley         Fence   FireplaceQu 
##         99.52         96.30         93.77         80.75         47.26 
##    GarageType  GarageFinish    GarageQual    GarageCond  BsmtExposure 
##          5.55          5.55          5.55          5.55          2.60 
##  BsmtFinType2      BsmtQual      BsmtCond  BsmtFinType1    MasVnrType 
##          2.60          2.53          2.53          2.53          0.55 
##    Electrical      MSZoning        Street      LotShape   LandContour 
##          0.07          0.00          0.00          0.00          0.00 
##     Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
##          0.00          0.00          0.00          0.00          0.00 
##    Condition2      BldgType    HouseStyle     RoofStyle      RoofMatl 
##          0.00          0.00          0.00          0.00          0.00 
##   Exterior1st   Exterior2nd     ExterQual     ExterCond    Foundation 
##          0.00          0.00          0.00          0.00          0.00 
##       Heating     HeatingQC    CentralAir   KitchenQual    Functional 
##          0.00          0.00          0.00          0.00          0.00 
##    PavedDrive      SaleType SaleCondition 
##          0.00          0.00          0.00

Veja que as variáveis PoolQC, MiscFeature, Alley, Fence e FireplaceQu não nos deixam outra alternativa senão a remoção delas do conjunto de dados, pois neste caso, a imputação de dados seria um grande problema na propagação do erro considerando as incertezas associadas aos métodos de imputação.

6.5 Selecão de Variável Tipo String:

  • PoolQC
  • MiscFeature
  • Alley
  • Fence
  • FireplaceQu
df.fac <- df.fac %>% 
            select(-PoolQC,-MiscFeature,
                   -Alley,-Fence,-FireplaceQu)
df.fac

6.6 Pré-Seleção das Variáveis Categóricas

set.seed(2510)
boruta_fac <- Boruta(SalePrice ~ ., data=na.omit(data.frame(SalePrice = df.int[,'SalePrice'],df.fac)), doTrace=2)  # perform Boruta search
cols.fac <- names(boruta_fac$finalDecision[boruta_fac$finalDecision %in% c("Confirmed", "Tentative")])  

6.7 Plot da importância da variável

plot(boruta_fac, cex.axis=.7, las=2, xlab="", main="Variable Importance")

7 Variáveis selecionadas df.fac

df.fac <- df.fac[,cols.fac]
apply(is.na(df.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>% 
           sort(decreasing = T)
##    GarageType  GarageFinish    GarageCond  BsmtExposure      BsmtQual 
##          5.55          5.55          5.55          2.60          2.53 
##      BsmtCond  BsmtFinType1    MasVnrType    Electrical      MSZoning 
##          2.53          2.53          0.55          0.07          0.00 
##      LotShape   LandContour     LandSlope  Neighborhood    Condition1 
##          0.00          0.00          0.00          0.00          0.00 
##      BldgType    HouseStyle     RoofStyle      RoofMatl   Exterior1st 
##          0.00          0.00          0.00          0.00          0.00 
##   Exterior2nd     ExterQual     ExterCond    Foundation     HeatingQC 
##          0.00          0.00          0.00          0.00          0.00 
##    CentralAir   KitchenQual    Functional    PavedDrive SaleCondition 
##          0.00          0.00          0.00          0.00          0.00

Como ainda estamos com valores nulos recorremos ao caret para imputar essas categorias.

Para construir um modelo que imput a categoria vamos retirar todas as variáveis que possuem alguma porcentagem de valores nulos e deixar somente uma delas em cada modelo.

7.1 Criando os data frames

df.GarageType <- df.fac %>% select(-GarageFinish,-GarageCond,
                                   -BsmtExposure,-BsmtQual,
                                   -BsmtCond,-BsmtFinType1,
                                   -MasVnrType,-Electrical)
df.GarageFinish <- df.fac %>% select(-GarageType,-GarageCond,
                                     -BsmtExposure,-BsmtQual,
                                     -BsmtCond,-BsmtFinType1,
                                     -MasVnrType,-Electrical)
df.GarageCond <- df.fac %>% select(-GarageType,-GarageFinish,
                                   -BsmtExposure,-BsmtQual,
                                   -BsmtCond,-BsmtFinType1,
                                   -MasVnrType,-Electrical)
df.BsmtExposure <- df.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtQual,
                                     -BsmtCond,-BsmtFinType1,
                                     -MasVnrType,-Electrical)
df.BsmtQual <- df.fac %>% select(-GarageType,-GarageFinish,
                                  -GarageCond,-BsmtExposure,
                                  -BsmtCond,-BsmtFinType1,
                                  -MasVnrType,-Electrical)
df.BsmtCond <- df.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtFinType1,
                                     -MasVnrType,-Electrical)
df.BsmtFinType1 <- df.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -MasVnrType,-Electrical)
df.MasVnrType <- df.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,-Electrical)
df.Electrical <- df.fac %>% select(-GarageType,-GarageFinish,
                                   -GarageCond,-BsmtExposure,
                                   -BsmtQual,-BsmtCond,
                                   -BsmtFinType1,-MasVnrType)

7.2 Usando o Random Forest

set.seed(12345)
fitControl <- trainControl(method="cv", 
                           number=3, 
                           savePredictions = 'final',
                           classProbs= F, 
                           summaryFunction = multiClassSummary)
set.seed(12345)
# Train the model using rf
rf.GarageType = train(GarageType ~ ., data= na.omit(df.GarageType), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.GarageFinish = train(GarageFinish ~ ., data= na.omit(df.GarageFinish), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.GarageCond = train(GarageCond ~ ., data= na.omit(df.GarageCond), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtExposure = train(BsmtExposure ~ ., data= na.omit(df.BsmtExposure), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtQual = train(BsmtQual ~ ., data= na.omit(df.BsmtQual), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtCond = train(BsmtCond ~ ., data= na.omit(df.BsmtCond), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtFinType1 = train(BsmtFinType1 ~ ., data= na.omit(df.BsmtFinType1), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.MasVnrType = train(MasVnrType ~ ., data= na.omit(df.MasVnrType), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.Electrical = train(Electrical ~ ., data= na.omit(df.Electrical), 
                                      tuneLength=5,trControl = fitControl,method='rf')

A função abaixo automatiza o processo de impute das categoricas nos valores nulos de cada variável.

f.pred <- function(fac,df.var,rf.model,var){ 

  new.df <- df.var[is.na(df.var[,var]),!(names(df.var) %in% var)]
  pred_rf <- predict(rf.model, newdata =  new.df)
  fac[is.na(fac[,var]),var] <- pred_rf

  return(df.fac)
}

8 Colocando tudo em listas

list.rfvars <- list(rf.GarageType,rf.GarageFinish,
                    rf.GarageCond,rf.BsmtExposure,
                    rf.BsmtQual,rf.BsmtCond,
                    rf.BsmtFinType1,rf.MasVnrType,
                    rf.Electrical)

list.dfvars <- list(df.GarageType,df.GarageFinish,
                    df.GarageCond,df.BsmtExposure,
                    df.BsmtQual,df.BsmtCond,
                    df.BsmtFinType1,df.MasVnrType,
                    df.Electrical)

vars <- c('GarageType','GarageFinish',
          'GarageCond','BsmtExposure',
          'BsmtQual','BsmtCond',
          'BsmtFinType1','MasVnrType',
          'Electrical')

Imputando os valores categóricos

for(j in 1:9) df.fac <- f.pred(df.fac,list.dfvars[[j]],list.rfvars[[j]],vars[j])

9 Juntandos os data frames

Jutando os dados tipo inteiros e string.

df.train <- bind_cols(df.int,df.fac)
df.train

Agora nosso df.train encontra-se limpo e pronto para ser explorado.

10 Dados de teste

Iremos realizar nos dados de teste as mesmas transformações aplicadas aos dados de treino.

df.test <- data.table::fread('../dados/test.csv', 
                             sep=",", 
                             showProgress = FALSE)  %>% 
            data.frame(stringsAsFactors = F)
df.test

11 Separando o conjunto de dados de teste pelo tipo.

11.1 Dados tipo inteiro

test.int <- df.test[,unlist(lapply(df.test,class)) %in% "integer"]
test.int

na amostra test.int também iremos imputar dados.

apply(is.na(test.int),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>% 
           sort(decreasing = T)
##   LotFrontage   GarageYrBlt    MasVnrArea  BsmtFullBath  BsmtHalfBath 
##         15.56          5.35          1.03          0.14          0.14 
##    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF    GarageCars 
##          0.07          0.07          0.07          0.07          0.07 
##    GarageArea            Id    MSSubClass       LotArea   OverallQual 
##          0.07          0.00          0.00          0.00          0.00 
##   OverallCond     YearBuilt  YearRemodAdd     X1stFlrSF     X2ndFlrSF 
##          0.00          0.00          0.00          0.00          0.00 
##  LowQualFinSF     GrLivArea      FullBath      HalfBath  BedroomAbvGr 
##          0.00          0.00          0.00          0.00          0.00 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces    WoodDeckSF   OpenPorchSF 
##          0.00          0.00          0.00          0.00          0.00 
## EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##          0.00          0.00          0.00          0.00          0.00 
##        MoSold        YrSold 
##          0.00          0.00

12 Selecionando as variáveis

Removendo as mesmas variáveis do conjunto de treinamento.

test.int <- test.int[,c('Id',cols.int)] %>%
                        select(-LotFrontage)
preProcess_missingdata_model <- preProcess(test.int[,-1], method='knnImpute')
preProcess_missingdata_model
## Created from 1365 samples and 27 variables
## 
## Pre-processing:
##   - centered (27)
##   - ignored (0)
##   - 5 nearest neighbor imputation (27)
##   - scaled (27)

Vamos agora usar esse modelo para prever os valores ausentes df.int

test.int1 <- predict(preProcess_missingdata_model, newdata = test.int[,-1] )
anyNA(test.int1)
## [1] FALSE
test.int <- data.frame(Id = test.int$Id,test.int1)

12.1 Tranformando dados tipo string em categorical

test.fac <- df.test[,unlist(lapply(df.test,class)) %in% "character"] %>%
                           apply(2,as.factor) %>% data.frame

removendo as mesas colunas do conjunto de treino.

test.fac <- test.fac %>% select(-PoolQC,-MiscFeature,
                                -Alley,-Fence,-FireplaceQu)

test.fac <- test.fac[,cols.fac]

verificando a porcentagem de valores nulos

apply(is.na(test.fac),2,function(x) round(100*sum(as.numeric(x))/length(x),2)) %>% 
           sort(decreasing = T)
##  GarageFinish    GarageCond    GarageType      BsmtCond      BsmtQual 
##          5.35          5.35          5.21          3.08          3.02 
##  BsmtExposure  BsmtFinType1    MasVnrType      MSZoning    Functional 
##          3.02          2.88          1.10          0.27          0.14 
##   Exterior1st   Exterior2nd   KitchenQual      LotShape   LandContour 
##          0.07          0.07          0.07          0.00          0.00 
##     LandSlope  Neighborhood    Condition1      BldgType    HouseStyle 
##          0.00          0.00          0.00          0.00          0.00 
##     RoofStyle      RoofMatl     ExterQual     ExterCond    Foundation 
##          0.00          0.00          0.00          0.00          0.00 
##     HeatingQC    CentralAir    Electrical    PavedDrive SaleCondition 
##          0.00          0.00          0.00          0.00          0.00

Novamente recorremos ao caret para imputar essas categorias.

Para construir um modelo que imput a categoria vamos retirar todas as variáveis que possuem alguma poercentagem de valores nulos e deixar somente uma dela em cada modelo.

12.2 Criando os data frames

df.GarageType <- test.fac %>% select(-GarageFinish,-GarageCond,
                                     -BsmtExposure,-BsmtQual,
                                     -BsmtCond,-BsmtFinType1,
                                     -MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.GarageFinish <- test.fac %>% select(-GarageType,-GarageCond,
                                     -BsmtExposure,-BsmtQual,
                                     -BsmtCond,-BsmtFinType1,
                                     -MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.GarageCond <- test.fac %>% select(-GarageType,-GarageFinish,
                                   -BsmtExposure,-BsmtQual,
                                   -BsmtCond,-BsmtFinType1,
                                   -MasVnrType,
                                   -MSZoning,-Functional,
                                   -Exterior1st,-Exterior2nd,
                                   -KitchenQual)
df.BsmtExposure <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtQual,
                                     -BsmtCond,-BsmtFinType1,
                                     -MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.BsmtQual <- test.fac %>% select(-GarageType,-GarageFinish,
                                  -GarageCond,-BsmtExposure,
                                  -BsmtCond,-BsmtFinType1,
                                  -MasVnrType,
                                  -MSZoning,-Functional,
                                  -Exterior1st,-Exterior2nd,
                                  -KitchenQual)
df.BsmtCond <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtFinType1,
                                     -MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.BsmtFinType1 <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.MasVnrType <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.MSZoning <- test.fac %>% select(-GarageType,-GarageFinish,
                                 -GarageCond,-BsmtExposure,
                                 -BsmtQual,-BsmtCond,
                                 -BsmtFinType1,-MasVnrType,
                                 -Functional,
                                 -Exterior1st,-Exterior2nd,
                                 -KitchenQual)
df.Functional <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,-MasVnrType,
                                     -MSZoning,
                                     -Exterior1st,-Exterior2nd,
                                     -KitchenQual)
df.Exterior1st <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,-MasVnrType,
                                     -MSZoning,
                                     -Functional,-Exterior2nd,
                                     -KitchenQual)
df.Exterior2nd <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,-MasVnrType,
                                     -MSZoning,-Functional,
                                     -Exterior1st,-KitchenQual)
df.KitchenQual <- test.fac %>% select(-GarageType,-GarageFinish,
                                     -GarageCond,-BsmtExposure,
                                     -BsmtQual,-BsmtCond,
                                     -BsmtFinType1,-MasVnrType,
                                     -MSZoning,
                                     -Functional,-Exterior1st,
                                     -Exterior2nd)

12.3 Usando o Random Forest

set.seed(12345)
fitControl <- trainControl(method="cv", 
                           number=3, 
                           savePredictions = 'final',
                           classProbs= F, 
                           summaryFunction = multiClassSummary)
set.seed(12345)
# Train the model using rf
rf.GarageType = train(GarageType ~ ., data= na.omit(df.GarageType), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.GarageFinish = train(GarageFinish ~ ., data= na.omit(df.GarageFinish), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.GarageCond = train(GarageCond ~ ., data= na.omit(df.GarageCond), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtExposure = train(BsmtExposure ~ ., data= na.omit(df.BsmtExposure), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtQual = train(BsmtQual ~ ., data= na.omit(df.BsmtQual), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtCond = train(BsmtCond ~ ., data= na.omit(df.BsmtCond), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.BsmtFinType1 = train(BsmtFinType1 ~ ., data= na.omit(df.BsmtFinType1), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.MasVnrType = train(MasVnrType ~ ., data= na.omit(df.MasVnrType), 
                                      tuneLength=5,trControl = fitControl,method='rf')


rf.MSZoning = train(MSZoning ~ ., data= na.omit(df.MSZoning), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.Functional = train(Functional ~ ., data= na.omit(df.Functional), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.Exterior1st = train(Exterior1st ~ ., data= na.omit(df.Exterior1st), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.Exterior2nd = train(Exterior2nd ~ ., data= na.omit(df.Exterior2nd), 
                                      tuneLength=5,trControl = fitControl,method='rf')

rf.KitchenQual = train(KitchenQual ~ ., data= na.omit(df.KitchenQual), 
                                      tuneLength=5,trControl = fitControl,method='rf')

A função abaixo automatiza o processo de impute das categoricas nos valores nulos de cada variável.

f.pred <- function(fac,df.var,rf.model,var){ 

  new.df <- df.var[is.na(df.var[,var]),!(names(df.var) %in% var)]
  pred_rf <- predict(rf.model, newdata =  new.df)
  fac[is.na(fac[,var]),var] <- pred_rf

  return(fac)
}

13 Colocando tudo em listas

list.rfvars <- list(rf.GarageType,rf.GarageFinish,
                    rf.GarageCond,rf.BsmtExposure,
                    rf.BsmtQual,rf.BsmtCond,
                    rf.BsmtFinType1,rf.MasVnrType,
                    rf.MSZoning,rf.Functional,
                    rf.Exterior1st,rf.Exterior2nd,rf.KitchenQual)

list.dfvars <- list(df.GarageType,df.GarageFinish,
                    df.GarageCond,df.BsmtExposure,
                    df.BsmtQual,df.BsmtCond,
                    df.BsmtFinType1,df.MasVnrType,
                    df.MSZoning,df.Functional,
                    df.Exterior1st,df.Exterior2nd,df.KitchenQual)

vars <- c('GarageType','GarageFinish',
          'GarageCond','BsmtExposure',
          'BsmtQual','BsmtCond',
          'BsmtFinType1','MasVnrType',
          'MSZoning','Functional','Exterior1st',
          'Exterior2nd','KitchenQual')

Imputando os valores categóricos

for(j in 1:length(vars)) test.fac <- f.pred(test.fac,list.dfvars[[j]],list.rfvars[[j]],vars[j])

14 Juntandos os data frames

Jutando os dados tipo inteiros e categóricos.

df.test <- bind_cols(test.int,test.fac)
df.test

Agora com 1319 linhas nosso df.test encontra-se limpo e pronto para ser explorado.

15 Exportando os dados limpos

write.csv(df.train,'../outputs/df.train.csv')
write.csv(df.test,'../outputs/df.test.csv')

Sérgio Carvalho

21 março, 2019